Kenneth Tay
Oct 9, 2018
Goal: Demonstrate that you know how to do data analysis in R
Minimum requirements:
vec <- c("a", "b", "c")
vec## [1] "a" "b" "c"vec[c(2,4)]## [1] "b" NAclasses <- list(quarter = "Fall 2018/19",
             ID = c("STATS 32", "STATS 101", "STATS 200"),
             credits = 12)
classes$ID## [1] "STATS 32"  "STATS 101" "STATS 200"classes[["credits"]]## [1] 12A special type of list:
data(mtcars)
str(mtcars)## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...str, summaryhead, tailnames, dim, nrow, ncoltablemean, median, sd, varfactorI want all the rows such that the value of the cyl column is equal to 2:
vehicles[vehicles$cyl == 2, ]df##    A    B
## 1  1    a
## 2  2    b
## 3  3    c
## 4 NA    d
## 5 NA <NA>df$A == 2## [1] FALSE  TRUE FALSE    NA    NAdf[df$A == 2, ]##       A    B
## 2     2    b
## NA   NA <NA>
## NA.1 NA <NA>Fix 1: test that the value is not NA and is equal to 2
df[!is.na(df$A) & df$A == 2, ]##   A B
## 2 2 bFix 2: use the which function
which(df$A == 2)## [1] 2df[which(df$A == 2), ]##   A B
## 2 2 bE.g. Take the mean of c(1,3,NA).
mean(c(1,3,NA))## [1] NAmean(c(1,3,NA), na.rm = TRUE)## [1] 2ggplot2 (and the + syntax)“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey
##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6
## 11 17.8  3.440         6
## 12 16.4  4.070         8
## 13 17.3  3.730         8
## 14 15.2  3.780         8
## 15 10.4  5.250         8
## 16 10.4  5.424         8
## 17 14.7  5.345         8
## 18 32.4  2.200         4
## 19 30.4  1.615         4
## 20 33.9  1.835         4
## 21 21.5  2.465         4
## 22 15.5  3.520         8
## 23 15.2  3.435         8
## 24 13.3  3.840         8
## 25 19.2  3.845         8
## 26 27.3  1.935         4
## 27 26.0  2.140         4
## 28 30.4  1.513         4
## 29 15.8  3.170         8
## 30 19.7  2.770         6
## 31 15.0  3.570         8
## 32 21.4  2.780         4What is the distribution of cylinders in my dataset?
What is the distribution of miles per gallon in my dataset?
What is the relationship between mpg and weight?
What is the relationship between mpg and time?
Not so good… 
Easier to see the trend 
For each value of cylinder, what is the distribution of mpg like?
I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?
ggplot2ggplot2 packageggplot2 reference manualData: Dataset we are using for the plot
##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6Geometries: Visual elements used for our data
Geom: point
Aesthetics: Defines the data columns which affect various aspects of the geom
3 different aesthetics:
We can have more than one layer in a graphic.
 = 
 + 
Each layer contains (essentially):
ggplot2 code: take 1Making use of ggplot’s sensible defaults:
ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg))ggplot2 code: take 2Using jitter to avoid “overplotting”:
ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg), 
               position = "jitter")ggplot2 code: take 3When layers share attributes, we only have to type them once:
ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")
 Optional material
One graphic contains:
Behind the scenes, R may need to do some transformation on the dataset to make the graphic.
Sometimes we need to tweak the position of the geometric elements because they obscure each other.
Only 9 data points?? 
Much better 
Default colors 
Manually chosen colors 
Default axis limits 
Manually chosen axis limits 
Refers to all non-data ink
ggplot2’s default theme 
Minimal theme 
Classic theme 
Dark theme 
rgb(0,0,1), rgb(1,0,0), rgb(0,0,0), rgb(1,1,1)